Skip to content

Conversation

@geropl
Copy link
Member

@geropl geropl commented Feb 7, 2022

Description

This configures a rate-limit for API calls to startWorkspace of 1 / 10s. The dashboard handles the error and waits until retryAfter to re-try the call (currently indefinitely).

Note that this is enforced per server instance at the moment, as the rate-limiter state is not shared between server instances. But it's a first step that already reduces the impact of future incidents.

Related Issue(s)

Context: #8043

How to test

Release Notes

configure basic rate-limiting for `startWorkspace`

Documentation

@geropl geropl requested a review from a team February 7, 2022 16:59
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Feb 7, 2022
@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #8073 (056ead0) into main (29c3a7d) will decrease coverage by 1.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #8073      +/-   ##
==========================================
- Coverage   12.01%   10.86%   -1.16%     
==========================================
  Files          20       18       -2     
  Lines        1190     1022     -168     
==========================================
- Hits          143      111      -32     
+ Misses       1043      909     -134     
+ Partials        4        2       -2     
Flag Coverage Δ
components-gitpod-cli-app 10.86% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29c3a7d...056ead0. Read the comment docs.

Copy link
Contributor

@jankeromnes jankeromnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, so cool to see the rate limiter code actually working! 👀 (Last time we tried it with Team Plan methods, I think it didn't behave as expected at all and we had to revert it again.)

I guess this makes our "infinite-redirect on workspace fast crash" failure mode a little better (i.e. we make the self-DDOS 10x less severe), and the code looks good to me. 👍

I'm just curious if this can have any impact in normal situations.

  • For one, I guess it will impact the use case "have 5 stopped workspaces and restart them all by clicking all the tabs" (I guess it will just make their restart [0-4] * 10s slower, but maybe that's okay)
  • Other impact can probably be tested on staging once merged. I expect no blockers, but if starting workspaces on staging becomes unreliable, we might have to revert again before deploying (that's what happened with our Team Plan calls rate-limiting attempt last time)

In any case, this looks good for merging & further testing on staging! 🚀

Comment on lines +39 to +42
startWorkspace: {
points: 1, // 1 workspace start per user per 10s
durationsSec: 10
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
log.warn({ userId }, "Rate limiter prevents accessing method due to too many requests.", rlRejected, { method });
throw new ResponseError(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { "Retry-After": String(Math.round(rlRejected.msBeforeNext / 1000)) || 1 });
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.round(rlRejected.msBeforeNext / 1000) || 1 });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably safer with a Math.ceil, but not a major issue (since the front-end retry code works well):

Suggested change
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.round(rlRejected.msBeforeNext / 1000) || 1 });
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.ceil(rlRejected.msBeforeNext / 1000) || 1 });

@roboquat roboquat merged commit d955ce1 into main Feb 7, 2022
@roboquat roboquat deleted the gpl/8043-redirect-loop branch February 7, 2022 17:54
@geropl
Copy link
Member Author

geropl commented Feb 8, 2022

@jankeromnes Thx for the thorough review! 🙏

I'm just curious if this can have any impact in normal situations.

Indeed. This is first and foremost a safety measure to help with the next days. Next on the list:

  • see if we can get the "number of workspaces started by the user recently" from the DB, and maybe show a modal if it's too excessive
  • fix the underlying issue: requires a re-deploy of workspace cluster (ws-proxy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note size/L team: webapp Issue belongs to the WebApp team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants